old policy
LearningtoConstrainPolicyOptimizationwith VirtualTrustRegion
ComparedtoDeepQ-learning,deeppolicygradient (PG) methods are often more flexible and applicable to discrete and continuous action problems. However, these methods tend to suffer from high sample complexity and training instability since the gradient may not accurately reflect the policy gain when the policy changes substantially [6].
Learning to Constrain Policy Optimization with Virtual Trust Region
We introduce a constrained optimization method for policy gradient reinforcement learning, which uses a virtual trust region to regulate each policy update. In addition to using the proximity of one single old policy as the normal trust region, we propose forming a second trust region through another virtual policy representing a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial if the old policy performs poorly. More importantly, we propose a mechanism to automatically build the virtual policy from a memory of past policies, providing a new capability for dynamically learning appropriate virtual trust regions during the optimization process. Our proposed method, dubbed Memory-Constrained Policy Optimization (MCPO), is examined in diverse environments, including robotic locomotion control, navigation with sparse rewards and Atari games, consistently demonstrating competitive performance against recent on-policy constrained policy gradient methods.
On the Theory and Practice of GRPO: A Trajectory-Corrected Approach with Fast Convergence
Group Relative Policy Optimization (GRPO), recently proposed by DeepSeek, is a critic-free reinforcement learning algorithm for fine tuning large language models. It replaces the value function in Proximal Policy Optimization (PPO) with group normalized rewards, while retaining PPO style token level importance sampling based on an old policy. We show that GRPO update rule in fact estimates the policy gradient at the old policy rather than the current one. However, since the old policy is refreshed every few steps, the discrepancy between the two remains small limiting the impact of this bias in practice. We validate this through an ablation study in which importance sampling is entirely removed, and updates are instead performed using the gradient estimated at a fixed old policy across multiple optimization steps. Remarkably, this simplification results in performance comparable to standard GRPO. Motivated by these findings, we propose a new algorithm: Trajectory level Importance Corrected GRPO (TIC GRPO). TIC GRPO replaces token level importance ratios with a single trajectory level probability ratio, yielding an unbiased estimate of the current policy gradient while preserving the critic free structure. Furthermore, we present the first theoretical convergence analysis for GRPO style methods, covering both the original GRPO and our proposed variant.
LAPD allowed to use drones as 'first responders' under new program
Citing successes other police departments across the country have seen using drones, the Los Angeles Police Commission said it would allow the LAPD to deploy unmanned aircraft on routine emergency calls. The civilian oversight body approved an updated policy Tuesday allowing drones to be used in more situations, including "calls for service." The new guidelines listed other scenarios for future drone use -- "high-risk incident, investigative purpose, large-scale event, natural disaster" -- and transferred their command from the Air Support Division to the Office of Special Operations. Previously, the department's nine drones were restricted to a narrow set of dangerous situations, most involving barricaded suspects or explosives. Bryan Lium told commissioners the technology offers responding officers and their supervisors crucial, real-time information about what type of threats they might encounter while responding to an emergency.
Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification
Group Relative Policy Optimization (GRPO) was introduced and used successfully to train DeepSeek R1 models for promoting reasoning capabilities of LLMs using verifiable or binary rewards. We show in this paper that GRPO with verifiable rewards can be written as a Kullback Leibler ($\mathsf{KL}$) regularized contrastive loss, where the contrastive samples are synthetic data sampled from the old policy. The optimal GRPO policy $\pi_{n}$ can be expressed explicitly in terms of the binary reward, as well as the first and second order statistics of the old policy ($\pi_{n-1}$) and the reference policy $\pi_0$. Iterating this scheme, we obtain a sequence of policies $\pi_{n}$ for which we can quantify the probability of success $p_n$. We show that the probability of success of the policy satisfies a recurrence that converges to a fixed point of a function that depends on the initial probability of success $p_0$ and the regularization parameter $\beta$ of the $\mathsf{KL}$ regularizer. We show that the fixed point $p^*$ is guaranteed to be larger than $p_0$, thereby demonstrating that GRPO effectively amplifies the probability of success of the policy.
To Switch or Not to Switch? Balanced Policy Switching in Offline Reinforcement Learning
Ma, Tao, Yang, Xuzhi, Szabo, Zoltan
Reinforcement learning (RL) -- finding the optimal behaviour (also referred to as policy) maximizing the collected long-term cumulative reward -- is among the most influential approaches in machine learning with a large number of successful applications. In several decision problems, however, one faces the possibility of policy switching -- changing from the current policy to a new one -- which incurs a non-negligible cost (examples include the shifting of the currently applied educational technology, modernization of a computing cluster, and the introduction of a new webpage design), and in the decision one is limited to using historical data without the availability for further online interaction. Despite the inevitable importance of this offline learning scenario, to our best knowledge, very little effort has been made to tackle the key problem of balancing between the gain and the cost of switching in a flexible and principled way. Leveraging ideas from the area of optimal transport, we initialize the systematic study of policy switching in offline RL. We establish fundamental properties and design a Net Actor-Critic algorithm for the proposed novel switching formulation. Numerical experiments demonstrate the efficiency of our approach on multiple benchmarks of the Gymnasium.
Adaptive Proximal Policy Optimization with Upper Confidence Bound
Zhang, Ziqi, Xu, Jingzehua, Zhuang, Zifeng, Liu, Jinxin, wang, Donglin
Trust Region Policy Optimization (TRPO) attractively optimizes the policy while constraining the update of the new policy within a trust region, ensuring the stability and monotonic optimization. Building on the theoretical guarantees of trust region optimization, Proximal Policy Optimization (PPO) successfully enhances the algorithm's sample efficiency and reduces deployment complexity by confining the update of the new and old policies within a surrogate trust region. However, this approach is limited by the fixed setting of surrogate trust region and is not sufficiently adaptive, because there is no theoretical proof that the optimal clipping bound remains consistent throughout the entire training process, truncating the ratio of the new and old policies within surrogate trust region can ensure that the algorithm achieves its best performance, therefore, exploring and researching a dynamic clip bound for improving PPO's performance can be quite beneficial. To design an adaptive clipped trust region and explore the dynamic clip bound's impact on the performance of PPO, we introduce an adaptive PPO-CLIP (Adaptive-PPO) method that dynamically explores and exploits the clip bound using a bandit during the online training process. Furthermore, ample experiments will initially demonstrate that our Adaptive-PPO exhibits sample efficiency and performance compared to PPO-CLIP.
Learning to Constrain Policy Optimization with Virtual Trust Region
Le, Hung, George, Thommen Karimpanal, Abdolshah, Majid, Nguyen, Dung, Do, Kien, Gupta, Sunil, Venkatesh, Svetha
We introduce a constrained optimization method for policy gradient reinforcement learning, which uses a virtual trust region to regulate each policy update. In addition to using the proximity of one single old policy as the normal trust region, we propose forming a second trust region through another virtual policy representing a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial if the old policy performs poorly. More importantly, we propose a mechanism to automatically build the virtual policy from a memory of past policies, providing a new capability for dynamically learning appropriate virtual trust regions during the optimization process. Our proposed method, dubbed Memory-Constrained Policy Optimization (MCPO), is examined in diverse environments, including robotic locomotion control, navigation with sparse rewards and Atari games, consistently demonstrating competitive performance against recent on-policy constrained policy gradient methods.
Policy Optimizations: TRPO/PPO
In this post, I will be talking about policy optimization methods from the papers Trust Region Policy Optimization (Schulman et al. 2015) and Proximal Policy Optimization Algorithms (Schulman et al. 2017). I will then briefly go over the Trust Region Policy Optimization method and two types of Proximal Policy Optimization methods: adaptive KL (Kullback-Leibler) penalties to the surrogate objective and clipped surrogate objective. In a traditional policy gradient method, we sample a trajectory of states, actions, and rewards, then update the policy using the sampled trajectories. While this method is great and solves basic control problems, the algorithm tends to be unstable and is inconsistent in solving an environment. A problem is that as we are updating the policy, the distribution of the inputs and outputs of the approximated policy distribution will change, resulting in instability.
Policy Optimizations: TRPO/PPO
In this post, I will be talking about policy optimization methods from the papers Trust Region Policy Optimization (Schulman et al. 2015) and Proximal Policy Optimization Algorithms (Schulman et al. 2017). I will then briefly go over the Trust Region Policy Optimization method and two types of Proximal Policy Optimization methods: adaptive KL (Kullback-Leibler) penalties to the surrogate objective and clipped surrogate objective. In a traditional policy gradient method, we sample a trajectory of states, actions, and rewards, then update the policy using the sampled trajectories. While this method is great and solves basic control problems, the algorithm tends to be unstable and is inconsistent in solving an environment. A problem is that as we are updating the policy, the distribution of the inputs and outputs of the approximated policy distribution will change, resulting in instability.